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Abstract. We consider the problem of doing fast and reUable estimation 
of the number z of non-zero entries in a sparse boolean matrix product. 
This problem has applications in databases and computer algebra. 
Let n denote the total number of non-zero entries in the input matri- 
ces. We show how to compute a 1 ± e approximation of z (with small 
probability of error) in expected time 0(n) for any e > 4/ The pre- 
viously best estimation algorithm, due to Cohen (JCSS 1997), uses time 
0{n/e^). We also present a variant using £'(sort(n)) I/Os in expectation 
in the cache-oblivious model. 

In contrast to these results, the currently best algorithms for computing a 
sparse boolean matrix product use time u}{;n:^''^) (resp. Lo{n^^'^/B) I/Os), 
even if the result matrix has only z = 0{ti) nonzero entries. 
Our algorithm combines the size estimation technique of Bar-Yossef et 
al. (RANDOM 2002) with a particular class of pairwise independent hash 
functions that allows the sketch of a set of the form ^ x C to be computed 
in expected time 0(1^1 + \C\) and ©(sortd^l + \C\)) I/Os. 
We then describe how sampling can be used to maintain (independent) 
sketches of matrices that allow estimation to be performed in time o(n) 
if z is sufficiently large. This gives a simpler alternative to the sketching 
technique of Ganguly ct al. (PODS 2005), and matches a space lower 
bound shown in that paper. 

Finally, we present experiments on real-world data sets that show the 
accuracy of both our methods to be significantly better than the worst- 
case analysis predicts. 

1 Introduction 

In this paper we will consider a, d x d boolean matrix as the subset of [d] x [d\ 
corresponding to the nonzero entries. The product of two matrices Ri and i?2 
contains {i, k) if and only if there exists j such that € Ri and (j, k) G i?2- 
The matrix product can also be expressed using basic operators of relational 
algebra: Ri M i?2 denotes the set of tuples (z, j, k) where (i, j) € -Ri and (j, k) G 
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i?2, and the projection operator n can be used to compute the tuples (i, k) where 
there exists a tuple of the form (i, •, fc) in i?i M i?2- Since most of our applications 
are in database systems we will primarily use the notation of relational algebra. 

We consider the following question: given relations Ri and R2 with schemas 
(a, 6) and (6, c), estimate the number z of distinct tuples in the relation Z = 
T^ac{Ri i?2)- This problem has been referred to in the literature as join-project 
or join-distinc^ We define ni = \Ri\, n2 = I-R2I, and n = ui + n2- As observed 
above, the join-project problem is equivalent to the problem of estimating the 
number of non-zero entries in the product of two boolean matrices, having ni 
and 712 non-zero entries, respectively. 

In recent years there has been several papers presenting new algorithms for 
sparse matrix multiplication |3|13|15j . In particular, these algorithms can be used 
to implement boolean matrix multiplication. However, the proposed algorithms 
all have substantially superlinear time complexity in the input size n: On worst- 
case inputs they require time Loiji^/'^), even when z = 0{n). 

In an influential work, Cohen [5] presented an estimation algorithm that, 
for any constant error probability (5 > 0, and any e > 0, can compute a 1 ± e 
approximation of 2 — \Z\ in time 0{n/e^). Cohen's algorithm applies to the 
more general problem of computing the size of the transitive closure of a graph. 

Our main result is that in the special case of sparse matrix product size 
estimation, we can improve this to expected time 0{n) for e > 4/-^. This 
means that we have a linear time algorithm for relative error where Cohen's 
algorithm would use time 0{n^/z). 

Approach. To build intuition on the size estimation question, consider the sets 
Aj = {i \{i,j) e Ri} and C-j = {k \{j,k) £ R2}. By definition, Z = Uj A' ^ 
Cj. The size of Z depends crucially on the extent of overlap among the sets 
{Aj X Cj}j. However, the total size of these sets may be much larger than both 
input and output (see [5]), so any approach that explicitly processes them is 
unattractive. 

The starting point for our improved estimation algorithm is a well-known 
algorithm for estimating the number of distinct elements in a data streaming 
context [4j. (We remark that the idea underlying this algorithm is similar to 
that of Cohen [B].) Our main insight is that this algorithm can be extended 
such that a set of the form Aj x Cj can be added to the sketch in expected 
time Od-Ajl -|- \Cj\), i.e., without explicitly generating all pairs. The idea is to 
use a hash function that is particularly well suited for the purpose: sufficiently 
structured to make hash values easy to handle algorithmically, and sufficiently 
random to make the analysis of sketching accuracy go through. 



^ Readers familiar with the database literature may notice that we consider projections 
that return a set, i.e., that projection is duplicate eliminating. We also observe that 
any equi-join followed by a projection can be reduced to the case above, having two 
variables in each relation and projecting away the single join attribute. Thus, there 
is no loss of generality in considering this minimal case. 



1.1 Motivation 



Cohen [7] investigated ttie use of the size estimation technique in sparse matrix 
computations. In particular, it can be used to find the optimal order of multiply- 
ing sparse matrices, and in memory allocation for sparse matrix computations. 

In addition, we are motivated by applications in database systems, where size 
estimation is an important part of query optimization. Examples of database 
queries that correspond to boolean matrix products are: 

— A query that computes all pairs of people in a social network with a distance 
2 connection ("possible friends"). 

— A query to compute all director- actor pairs who have done at least one movie 
together. 

— In a business database with information on orders, and a categorization 
of products into types, compute the relation that contains a tuple {c,p) if 
customer c has made an order for a product of type p. 

As a final example, we consider a fundamental data mining task. Given a list 
of sets, the famous Apriori data mining algorithm 2J finds frequent item pairs 
by counting the number occurrences of item pairs where each single element 
is frequent. So if Ri = i?2 denotes the relationship between high-support (i.e., 
frequent) items and sets in which they occur, Z is exactly the pairs of frequent 
items, and the number of distinct items in Z determines the space usage of 
Apriori. Since Apriori may be very time consuming, it is of interest to establish 
whether sufficient space is available before choosing the support threshold and 
running the algorithm. 

1.2 Further related work 

JD sketch. Ganguly et al. [9] previously considered techniques that compute a 
data structure (a sketch) for Ri and R2 (individually), such that the two sketches 
suffice to compute an approximation of z. 

Define Ua = \{i \ 3j.(i,j) £ Ri}\ and = \{k \ 3j.{j,k) G i?2}|. Ganguly 
et al. show that for any constant c and any /3, a sketching method that returns 
a c-approximation with probability J7(l) whenever z > (5 must, on a worst-case 
input, use expected space 

n{mm{ni+n2,nanc{ni/na+n2/nc)/l3)) = n{mm{ni+n2,{ninc+n2na)/(3)) bits. 

The lower bound proof applies to the case where ni = n2, Ua — ric, and z < 
ria + n-c- We note that [5] claims a stronger lower bound, but their proof does 
not establish a lower bound above ni -I- 712 bits. Ganguly et al. present a sketch 
whose worst-case space usage matches the lower bound times polylogarithmic 
factors (while not stated in [3] , the trivial sketch that stores the whole input can 
be used to nearly match the first term in the minimum). 

In Section [3] we analyze a simple sketch, previously considered in other con- 
texts by Gibbons [2] and Ganguly and Saha [TU] . It similarly matches the above 
worst-case bound, but the exact space usage is incomparable to that of [9] . 



The focus of ^ is on space usage, and so the time for updating sketches, 
and for computing the estimate from two sketches, is not discussed in the paper. 
Looking at the data structure description we see that the update time grows 
hnearly with the quantity si, which is f2{n) in the worst case. Also, the sketch 
uses a number of summary data structures that are accessed in a random fashion, 
meaning that the worst case number of I/Os is at least f2{n) unless the sketch 
fits internal memory. By the above lower bound we see that keeping the sketch 
in internal memory is not feasible in general. In contrast, the sketch we consider 
allows collection and combination of sketches to be done efficiently in linear time 
and I/O. 

Distinct elements and distinct paths estimation. Our work is related in 
terms of techniques to papers on estimating the number of distinct items in a 
data stream (see [4] and its references). However, our basic estimation algorithm 
does not work in a general streaming model, since it crucially needs the ability 
to access all tuples with a particular value on the join attribute together. 

Ganguly and Saha [T^ consider the problem of estimating the number of 
distinct vertex pairs connected by a length-2 path in a graph whose edges are 
given as a data stream of n edges. This corresponds to size estimation for the 
special case of squaring a matrix (or self-join in database terminology). It is 
shown that space i/n is required, and that space roughly 0{n^^^) suffices for 
constant e (unless there are close to n connected components). The estimation 
itself is a join-distinct size estimation of a sample of the input having size no 
smaller than 0{n^^'^ /e'^). Using Cohen's estimation algorithm this would require 
time C'(n^/''/e'*), so this is 0{n) time only for e > 1/ ^-^/n. 

Join synopses. Acharya et al. 1 proposed so-called join synopses that provide 
a uniform sample of the result of a join. While this can be used to estimate result 
sizes of a variety of operations, it does not seem to yield efficient estimates of 
join-project sizes. The reason is that a standard uniform sample is known to be 
inefficient for estimating the number of distinct values [S]. In addition, Acharya 
et al. assume the presence of a foreign- key relationship, i.e., that each tuple has 
at most one matching tuple in the other table(s), which is also known as a snow 
flake schema. Our method has no such restriction. 

Distinct sampling. Gibbons [TT] considered different samples that can be 
extracted by a scan over the input, and proposed distinct samples, which offer 
much better guarantees with respect to estimating the number of distinct values 
in query results. Gibbons shows that this technique applies to single relations, 
and to foreign key joins where the join result has the same number of tuples 
as one of the relations. In Section |3] we show that the distinct samples, with 
suitable settings of parameters, can often be used in our setting to get an accurate 
estimate oi z — \Z\. The processing of a pair of samples to produce the estimate 
consists of running the efficient estimation algorithm of Section[2]on the samples, 
meaning that this is time- and I/O-efficient. 



2 Our algorithm 



The task is to estimate the size z of Z = Trac{Ri ^ R2)- We may assume that 
attribute values are ©(log n)-bits integers, since any domain can be mapped 
into this one using hashing, without changing the join result size with high 
probability. When discussing I/O bounds, B is the number of such integers that 
fits in a disk block. In linear expected time (by hashing) or sort(n) I/Os we can 
cluster the relations according to the value of the join attribute b. By initially 
eliminating input tuples that do not have any matching tuples in the other 
relation we may assume without loss of generality that z > n/2. 

In what follows. A: is a positive integer parameter that determines the space 
usage and accuracy of our method. The technique used is to compute the fcth 
smallest value v of a hash function h{x,y), for {x,y) G Z. Analogously to the 
result by Bar-Yossef et al. [4] we can then use z = fc/w as an estimator for z. 

Our main building block is an efficient iteration over all tuples {x,-,y) G 
Ri 1X1 i?2 for which h{x,y) is smaller than a carefully chosen threshold p, and is 
therefore a candidate for being among the k smallest hash values. The essence 
of our result lies in how the pairs being output by this iteration are computed 
in expected linear time. We also introduce a new buffering trick to update the 
sketch in expected amortized 0(1) time per pair. In a nutshell, each time k 
new elements have been retrieved, they are merged using a linear time selection 
procedure with the previous k smallest values to produce a new (unordered) list 
of the k smallest values. 

Theorem 1. Let i?i(a, 6) and R2{b,c) be relations with n tuples in total, and 
define z = \T^ac{Ri ^ R2)\- Let £, < e < ^ &e given. There are algorithms 
that run in expected 0{n) time on a RAM, and expected O{sort{n)) I/Os in the 
cache- oblivious model, and output a number z such that for k — 

- Pr[(l - e)z < z < (1 + e)z] > 2/3 when z > fc^, and 

- Pr[z < (1 + e)P] > 2/3 when z <P. 

Observe that for e > 4/-^z we will be in the first case, and get the desired 1 ± e 
approximation with probability 2/3. The error probability can be reduced from 
1/3 to S by the standard technique of doing 0{\og{l/5)) runs and taking the 
median (the analysis follows from a Chernoff bound). We remark that this can 
be done in such a way that the 0(log(l/(5)) factor affects only the RAM running 
time and not the number of I/Os. For constant relative error e > we have the 
following result: 

Theorem 2. In the setting of Theorem^ if e is constant there are algorithms 
that run in expected 0{n) time on a RAM, and expected O{sort{n)) I/Os in the 
cache- oblivious model, that output z such that Pr[(l — e)z < z < (1 + e)z] = 
l-Oil/Vn). 

The error probability can be reduced to n""^ for any desired constant c by running 
the algorithms 0{c) times, and taking the median as above. 



2.1 Finding pairs 



For B = ■Kb{Ri) U 7rfc(i?2) and each i € i3 let Ai — TTa{'^b=i{Ri)) and Ci = 
T^c{o'b=i{R2))- We would like to efficiently iterate over all pairs {x,y) e Ai x d, 
i ^ B, for which h{x,y) is smaller than a threshold p. This is done as follows 
(see Algorithm [l] for pseudocode). 

For a set U, let hi, h2 : U [0; 1] be hash functions chosen independently at 
random from a pairwise independent family, and define h : U x U ^ [0; 1] bjj^ 

h{x,y) = {hi{x) - h2{y)) mod 1. 

It is easy to show that h is also a pairwise independent hash function — a 
property we will utilize later. Now, conceptually arrange the values of h{x,y) in 
an \Ai\ X \Ci\ matrix, and order the rows by increasing values of hi{x), and the 
columns by increasing values of h2{y)- Then the values of h{x,y) will decrease 
(modulo 1) from left to right, and increase (modulo 1) from top to bottom. 

For each i G B, we traverse the corresponding \Ai\ x \Ci\ matrix by visiting 
the columns from left to right, and in each column t finding the row s with the 
smallest value of h{xs,yt)- Values smaller than p in that column will be found 
in rows subsequent to s. When all such values have been output, the search 
proceeds in column t + 1. Notice, that if h{xg,yt) was the minimum value in 
column t, then the minimum value in column t + 1 is found by increasing s until 
h{xs, yt+i) < /i(a:(s_i) mod \Ai\j yt+i)- We observe that the algorithm is robust to 
decreasing the value of the threshold p during execution, in the sense that the 
algorithm still outputs all pairs with hash value at most p. 



2.2 Estimating the size 

While finding the relevant pairs, we will use a technique that allows us to main- 
tain the k smallest hash values in an unordered buffer instead of using a heap 
data structure (lines 14 -18 in Algorithm [T]) . In this way we are able to maintain 



the k smallest hash values in constant amortized time per insertion in the buffer, 
eliminating the log k factor implied by the heap data structure. 

Let S and F be two unordered sets containing, respectively, the k smallest 
hash values seen so far (all, of course, smaller than p), and the latest up to k 
elements seen. We avoid duplicates in 5" and F (i.e., the sets are kept disjoint) by 
using a simple hash table to check for membership before insertion. Whenever 
\F\ = k the two sets S and F are combined in order to obtain a new sketch S. 
This is done by finding the median of SU F, which takes 0{k) time using either 
deterministic methods (see [8]) or more practical randomized ones [T2j . 

At each iteration the current A:th smallest value in S may be smaller than 
the initial value p, and we use this as a better substitute for the initial value 
of p. However, in the analysis below we will upper bound both the running time 
and the error probability using the initial threshold value p. 



We observe that this is different from the "composable hash functions" used by 
Ganguly at al. [9]. 



Algorithm 1 Pseudocode for the size estimator. 



1: procedure DisItems(p, e) 

2: [g/e^l 

3: 

4: for i G 5 do 

5: X -i^ Ai sorted according to /ii-value 

6: y Ci sorted according to /i2-value 

7: s ^ 1 

8: for t := 1 to \Ci\ do 

9: while h{xs,yt) > h{x^s-i) mod \A,\,yt) do > Find s s.t. h{xs,yt) is min. 

10: s <- (s + 1) mod 1A| 

11: end while 

12: s ^ s 

13: while h{xs,yt) < p do [> Find all s where h{xs,yt) < p 

14: F^FU{(x,,yt)} 

15: if |F| — k then > Buffer filled, find smallest hash values in S* U f 

16: (p, S) ^ Combine(S', F) 

17: 

18: end if 

19: s ^ (s + 1) mod \ Ai\ 

20: end while 

21: end for 

22: end for 

23: {p, S) ^ Combine(S', F) 

24: if IS] = fc then 

25: return "2 = ^ and 5 G [(1 ± e)^] with probability 2/3" 

26: else 

27: return "z = fc^, 2 < fc^ with probability 2/3" 

28: end if 

29: end procedure 

30: procedure Combine(S, F) 

31: ?; <— Rank(/i(S') U fc) > Rank(-, fc) returns the fcth smallest value 

32: S ^ {x e SU F\h{x) <v} 

33: return («, S) 
34: end procedure 



2.3 Time analysis 

We split the time analysis into two parts. One part accounts for iterations of 
the inner while loop in lines [13^201 and the other part accounts for everything 
else. We first consider the RAM model, and then outline the analysis in the 
cache-oblivious model. 

Inner while loop. Observe that for each iteration, one pair (xg, yt) is added to F 
(if it is not already there). For each t £ d, p\Ai\ elements are expected to 
be added since each pair {xs,yt) is added with probability p. This means that 
the expected total number of iterations is ©(pIvAiUCij). Each call to Combine 



costs time 0{k), but we notice that there must be at least k iterations between 
successive calls, since the size of F must go from to k. Inserting a new value 
into F costs 0(1) since the set is not sorted. Hence, the total cost of the inner 
loop is 0{p\A^\\C^\). 

Remaining cost. Consider the processing of a single i G B in Algorithm [T] The 
initial sorting of hash values can be done with bucket sort requiring expected 
time Od^il + \Ci\) since the numbers sorted are pairwise independent (by the 
same analysis as for hashing with chaining). 

For the iteration in lines [9 11 observe that h{xs,yt) is monotone modulo 1, 



and we have at most a total of 2\Ai\ increments of s among all t £ Ci. Thus, 
the total number of iterations is 0(1^^1), and the total cost for each i G B is 
Oi\A\ + \C,\). 

The time for the final call to Combine is dominated by the preceding cost 
of constructing S and F. 

I/O efficient variant. As for I/O efficiency, notice that a direct implementation 
of Algorithm [l] may cause a linear number of cache misses if Ai and Ci do 
not fit into internal memory. To get an I/O-efficient variant we use a cache- 
oblivious sorting algorithm, sorting i?i according to {b, hi{a)), and i?2 according 
to (6, /12(c)), such that the sorting steps for each i G B is replaced by one global 
sorting step. 

The rest of the algorithm works directly in a cache-oblivious setting. To see 
this, notice that it suffices to keep in internal memory the two input blocks that 
are closest to each of the pointers s, t, and s. The cache-oblivious model assumes 
the cache to behave in an optimal fashion, so also in this model there will be 
f2{B) operations between cache misses, and 0{n/B) I/Os, expected, in total. 

Lemma 1. Suppose Ri{a,b) and i?2(fo, c) are relations with n tuples in total. 
Let p > and e > be given. Then Algorithm [7] runs in expected 0{n + 
'^iP\Ai\\Ci\) time and 0{l/e'^) space on a RAM, and can be modified to use 
expected O{sort[n)) I/Os in the cache- oblivious model. 



Choice of threshold p. We would like a value of p that ensures the expected 
processing time is 0{n). At the same time p should be large enough that we 



expect to reach line 25 where an exact estimate is returned (except possibly in 



the case where z is small). 

Lemma 2. Let j G B satisfy \Ai\\C^\ < \Aj\\Cj\ for all i G B. Then p = 
min(l/fc, fc/(|^j| |Cj I)) gives an expected 0{n) running time for Algorithm^ 

Proof. We argue that for each i, p\Ai\\Ci\ < max(|^i|, \Ci\), which by Lemmajl] 
implies running time Oin + Y^^p\A^\\Ci\) = 0{n + J^.max{\Ai\,\Ci\)) = 0{n). 
Suppose first that \A^\\Ci\ > P. Then p = k/{\Aj\\Cj\) and p|yli||C,| < fc < 
A/|^i||Ci| < max(|^i|, \Ci\). Otherwise, when |^i||Ci| < fc^, we have p < 1/fc and 
p|A||C.| = \A\m/k < max(|^,|. Id). □ 



We note that when Ri and R2 are sorted according to b, the value of p 
specified above can be found by a simple scan over both inputs. Our experiments 
indicate that in practice this initial scan is not needed, see Section |4] for details. 



2.4 Error probability 

Theorem 3. Let h be a pairwise independent hash function. Suppose we are 
provided with a stream of elements N with h{x) < v for all x N . Further, let e, 
< e < I be given and assume that p > min where k > , and z is 

the number of distinct items in N. Then Algorithm^produces an approximation 
z of z such that 

- Pr[(f - £)z < z < (i + e)z] > 2/3 for z > k"^ , and 

- Pr[5 < (1 + e)fc2] > 2/3 for z < k'^ . 

Proof. The error probability proof is similar to the one that can be found in 0] , 
with some differences and extensions. We bound the error probability of three 
cases: the estimate being smaller/larger than the multiplicative error bound, and 
the number of obtained samples being too small. 

Estimate too large. Let us first consider the case where z > {l + e)z, i.e. the 
algorithm overestimates the number of distinct elements. This happens if the 
stream N contains at least k entries smaller than fc/(l + e)z. For each pair 
(a, c) € Z define an indicator random variable ^(a,c) ^ 



1 h{a, c) < fc/(l + e)z 
otherwise 



That is, we have z such random variables for which the probability of ^(a,c) = 1 
is exactly k/ {l + e)z and E[X(a c)] — k/{l + e)z. Now define Y — J2{a c)ez ^{a,c) 
so that E[r] = I][Eia,c)ezX'ia,c)] = E(ax)6Z E[^(a,c)] = k/{l+'e). By the 
pairwise independence of the -^(a,c) we also get Var(y) < fc/(l + e). Using 
Chebyshev's inequality |14| we can bound the probability of having too many 
pairs reported: 



Pr \Y > k] < Pr 



|y-E[r]| >fc-^ 



^ Var[y] ^ fc/(l + £) ^ 1 

/ ^ \2 — / ^ \ 2 — 6 



since k > 9/e^. 

Estimate too small. Now, consider the case where z < (1 — e)z which hap- 
pens when at most k hash values are smaller than fc/(l — e)z and at least k hash 
values are smaller than p. Define X', ^ as 



1 h{a, c) < fc/(l - e)z 
otherwise 



so that E[X[^^^)] = k/{l~e)z < (l+e)k/z. Moreover, with Y' = Y.(a.c)&z ^a,c) 
we have E[F'] — kj (1— e), and since the indicator random variables defined above 
are pairwise independent, we also have Var[y] < E[y] < (l + e)fc. Chebyshev's 
inequality gives: 



Pr [r' < fc] < Pr 



|y'-Em>^ 



^ Var[r] ^ (l + g)fc ^ 



since /c > 

A'^ot enough samples. Consider the case where jS*! < after all pairs have 
been retrieved. In this case the algorithm returns /S — k'^ a.s an upper bound 
on the number of distinct elements in the output, and we have two possible 
situations: either there is actually less than k^ distinct pairs in the output, in 
which case the algorithm is correct, or there are more than k'^ distinct elements 
in the output, in which case it is incorrect. In the latter case, less than k hash 
values have been smaller than p and the fcth smallest value v is therefore larger 
than p. Define X'/ ■, as 

^ (a,c) 



Y" 



1 h{a, c) < p 
otherwise 



and let again Y" = J2{ac)ez ^['a c)- resuhs that E[X('^ c)^ = P and E[Y"] = zp, 
and because of pairwise independancy of X'^'^ also Var[y"] < E[y"]. Using 
Chebyshev's inequality and remembering that z > fc^ in this case we have: 

Pr[r" < A:] < Pr[|y" - E[r"]l > zp - k] < , < < 8/k < 1/18. 

[zp-kr [\zp) 

using that A: > O/e^ > 144. 

In conclusion, the probability that the algorithm fails to output an estimate 
within the given limits is at most 1/6+1/9+1/18 = 1/3. □ 

For the proof of Theorem[2]we observe that in the above proof, if e is constant 
the error probability is 0{\/k). Using k = ^Jn we get linear running time and 
error probability O^Vj ^fn). 



Realization of hash functions. We have used the idealized assumption that 
hash values were real numbers in (0; 1). Let m — v? . To get an actual implemen- 
tation we approximate (by rounding down) the real numbers used by rational 
numbers of the form i/m, for integer i. This changes each hash value by at most 
2/m. Now, because of the way hash values are computed, the probability that 
we get a different result when comparing two real-valued hash values and two 
rational ones is bounded by 2/m. Similarly, the probability that we get a differ- 
ent result when looking up a hash value in the dictionary is bounded by 2k /m. 
Thus, the probability that the algorithm makes a different decision based on the 
approximation, in any of its steps, is 0{kn/m) = o(l). Also, for the final output 
the error introduced by rounding is negligible. 



3 Distinct sketches 



A well-known approach to size estimation in, described in generality by Gib- 
bons [11] and explicitly for join-project operations in |10l3j . is to sample random 
subsets R'l C Ri and R'2 Q R2, compute Z' — nac{Ri ^ R'2), and use the size 
of Z' to derive an estimate for z. This is possible if R'l = (Taes„(i?i), where 
Sa Q i^a{Ri) is a random subset where each element is picked independently 
with probability pi, and similarly i?2 = '^a^sA^'i), where Sc Q 7rc(i?2) includes 
each element independently with probability p2- Then z' = \Z'\/{piP2) is an 
unbiased estimator for z. The samples can be obtained in small space using hash 
functions whose values determine which elements are picked for Sa and Sc- The 
value \Z'\ can be approximated in linear time using the method described in sec- 
tion [2] if the samples are sorted — otherwise one has to add the cost of sorting. 
In cither case, the estimation algorithm is I/O-efhcient. 

Below we analyze the variance of the estimator z', to identify the minimum 
sampling probability that introduces only a small relative error with good prob- 
ability. The usual technique of repetition can be used to reduce the error prob- 
ability. Recall that we have two relations with rii and n2 tuples, respectively, 
and that Ua and ric denotes the number of distinct values of attributes a and c, 
respectively. Our method will pick samples R'l and R'2 of expected size s from 
each relation, where s = pinii = p2f^2 is a parameter to be specified. 

Theorem 4. Let R[ and R'2 be samples of size s, obtained as described above. 
Then z' — |7rac(-R'i 1^ ^2)l/(?'iP2) is a l±e approximation of z = \'Kac{Ri ^ -^2)! 
with probability 5/6 if z > (5, where /3 = ^ ( "■^"1+"°"^ ). If z < (3 then z' < 
(1 -|- e)/3 with probability 5/6. 



3.1 Analysis of variance 



To arrive at a sufficient condition that z' is a 1 ± e approximation of z with good 
probability, we analyze its variance. To this end define Zi. = {j \ {i.,j) € Z}, 
Z.j = {i I (ij) e Z}, and let 

f 1 -pi, if j e S'q f 1 -p2, if j e -Sc 

* \ — Pi, otherwise \ — P2, otherwise 



By definition of 5a, E[X,] Pr[i e 5a] (1 - pi) - Vr[i ^ Sa]pi = 0. Similarly, 
E[Fj] = 0. We have that {i,j) e Z' if and only if e Z and (z, j) e 5a x Sc- 
This means that z'pip2 = j)(zz{Xi~^P^)i'Yj~^P2)- By linearity of expectation, 
'E,[{Xi+pi)iYj+p2)] = P1P2, and we can write the variance of z'piP2, Var(z'pip2) 
as 



E 



J2 iiX,+Pi){Y, + P2) -P1P2) 



Expanding the product and using linearity of expectation, we get 

(j,j)ez(ij')ez ii,])ez ii',j)ez (i,j)ez 

ie^j.j'eZi. jeCi,i'ez.j 

Since E [X,^] = — pi)^ + (1 — = Pi - Pi < Pi, and similarly 

E [Yj^] < p2 we can upper bound Var(z') as follows: 

Var(z') = (pip2)"^ Var(z'piP2) 

<(PiP2)"^(X! E ^'i?'2 + E E + 

ieAjj'eZi. jeCi,i'ez.j 

< (PiP2)"^ (ncZplPa +nazplp2 + 

= ("c/Pl + na/p2 + iPlP2y^) Z . 

3.2 Sufficient sample size 

We are ready to derive a bound on the probability that z' deviates significantly 
from z. Choose < e < 1. Since z = E[z'] Chebyshev's inequality says 

Pr[|z' -z]> ez] < \ ' < {njpi + njp2 + (pi?32)"^) /{e^z). 

This can equivalently be expressed in terms of the sample size s, since pi = s/ni 
and P2 = s/n2: 

Pr[|z' — z\> ez] < {ricni + ri(,n2 + ?^lf^2/s) / (se^z). 

We seek a sufficient condition on s that the above probability is bounded by 
some constant 5 < ^ (e.g. S = 1/6). In particular it must be the case that 
nin2 / {s^ z) < S, which implies s > rii, n2/{Sz) > ^ rii, n2/ {SnaUc)- Hence, 
using the arithmetic-geometric inequality: 



nin2/s < \/ncninan2S < {12^711 + nan2) / (2^/5) . 
In other words, it suffices that 



z 



One apparent problem is the chicken-egg situation: z is not known in advance. 
If a lower bound on z is known, this can be used to compute a sufficient sample 
size. Alternatively, if we allow a larger relative error whenever z < ^ we may 
compute a sufficient value of s based on the assumption z > /3. Whenever z < /? 
we then get the guarantee that z' < (1 -I- e)(3 with probability 1 ~ S. Theoremjl] 
follows by fixing s and solving for /3. 



Optimality. For constant e and 6 our upper bound matches the lower bound of 
Ganguly et al. [S] whenever this does not exceed rti + 712. It is trivial to achieve 
a sketch of size 0{{ni + 712) ^og{ni + 712)) bits (simply store hash signatures for 
the entire relations) . We also note that the lower bound proof in [9_ uses certain 
restrictions of parameters (rti = n2, na = Uc, and z < Ua + Uc), so it may be 
possible to do better in some settings. 



4 Experiments 

We have run our algorithm on most of the datasets from the Frequent Itemset 
Mining Implementations (FIMI) Repositorjj^ together with some datasets ex- 
tracted from the Internet Movie Database (IMDB). Each dataset represents a 
single relation, and motivated by the Apriori space estimation example in the 
introduction, we perform the size estimation on self-joins of these relations. Ta- 
ble [1] displays the size of each dataset together with the number of distinct a- 
and c- values. 



Instance 


2 


ria (= Uc) 


eo.i 


eo.oi 


accidents 


94- 




468 


1 


18 


3 


73 


bms-pos 


760- 


10^ 


1,657 





78 


2 


47 


bms-webview- 1 


128 • 


10^ 


497 


1 


04 


3 


29 


bms-webview-2 


1.45 


10" 


3,340 





80 


2 


54 


chess 


5.24 


10^ 


75 


2 


00 


6 


33 


connect 


13.8 


10^ 


129 


1 


62 


5 


12 


directoractor 


734- 


10" 


50,645 





14 





44 


kosarak 


66.2 


10" 


41,270 





42 


1 


32 


movieactor 


111 • 


10" 


51,226 





36 


1 


14 


mushroom 


7.17 


10^ 


119 


2 


16 


6 


82 


pumsb 


1.07 


10" 


2,113 





74 


2 


35 


pumsb_star 


967- 


10=* 


2,088 





78 


2 


46 


retail 


7.19 


10" 


16,470 





80 


2 


53 



Table 1. Characteristics of the used datasets. The rightmost middle column displays 
the size = |7ra(-Ri)| (which in this case is equals = | U7rc(-R2)|). The two rightmost 
columns display the theoretical error as described in Theorem |4] for pi = p2 = 0.1 
and pi = P2 = 0.01, respectively. These theoretical error bounds, which hold with 
probability 5/6, are significantly larger than the actual observed errors in Figure [2] 



Rather than selecting hi and /12 from an arbitrary pairwise independent 
family, we store functions that map the attribute values to fully random and 
independent values of the form where d is a 64 bit random integer formed 

by reading 64 random bits from the Marsaglia Random Number CDROIVFJ 



^ http://fimi.cs.helsinki.fi 
http:/ /www. stat.fsu.edu/pub/diehard/ 




Ratio Ratio 



(a) k = 256 (b) k = 1024 

Fig. 1. The cumulative distribution functions for k = 256 and k = 1024. It is seen 
that k — 1024 yields a more precise estimate than k — 256 with 2/3 of the estimates 
being within 4% and 10% of the exact size, respectively. 




(a) k = 1024, pi = P2 = 0.1 (b) k = 1024, pi = pa = 0.01 

Fig. 2. Plots for sampling with probability 10% and 1%. If the sampling probability 
is too small, no elements at all may reach the sketch and in these cases we are not able 
to return an estimate. Instances with no estimates have been left out of the graph. 



We have chosen an initial value of p = 1 for our tests in order to be certain to 
always arrive at an estimate. In most cases we observed that p quickly decreases 
to a value below 1/k anyway. But as the sampling probability decreases, the 
probability that the sketch will never be filled increases, implying that we will 
not get a linear time complexity with an initial value of p = 1. In the cases where 
the sketch is not filled, we report \F\/{pip2) as the estimate, where |F| is the 
number of elements in the buffer. 

Tests have been performed for k ~ 256 and k = 1024. In each test, 60 inde- 
pendent estimates were made and compared to the exact size of the join-project. 
By sorting the ratios "estimate" /" exact size" we can draw the cumulative distri- 
bution function for each instance that, for each ratio- value on the x-axis, displays 
on the y-axis the probability that an estimate will have this ratio or less. Figurejl] 
shows plots for k = 256 and k — 1024. In Table [2] we compare the theoretical 
error e with observed error for 2/3 of the results. As seen, the observed error is 
smaller than the theoretical upper bound. 

In Figure [2] we perform sampling with 10% and 1% probability, as described 
in SectionjSj Again, the samples are chosen using truly random bits. The variance 
of estimates increase as the probability decreases, but increases more for smaller 
than for larger instances. If the sampling probability is too small, no elements 
at all may reach the sketch and in these cases we are not able to return an 
estimate. As seen, the observed errors in the figure are significantly smaller than 
the theoretical errors seen in Table [B 



k 


e 


Observed e 


256 


0.188 


0.1 


1024 


0.094 


0.04 



Table 2. The theoretical error bound is e = as stated Theorem|3| The observed 

error in Figure [ij however, is significantly less. 



5 Conclusion 

We have presented improved algorithms for estimating the size of boolean matrix 
products, for the first time allowing o(l) relative error to be achieved in linear 
time. An interesting open problem is if this can be extended to transitive closure 
in general graphs, and/or to products of more than two matrices. 
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